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Abstract 

In this paper we demonstrate a simple heuristic adaptive restart technique that can dra- 
matically improve the convergence rate of accelerated gradient schemes. The analysis of the 
technique relies on the observation that these schemes exhibit two modes of behavior depending 
on how much momentum is applied. In what we refer to as the 'high momentum' regime the 
iterates generated by an accelerated gradient scheme exhibit a periodic behavior, where the 
period is proportional to the square root of the local condition number of the objective function. 
This suggests a restart technique whereby we reset the momentum whenever we observe periodic 
behavior. We provide analysis to show that in many cases adaptively restarting allows us to 
recover the optimal rate of convergence with no prior knowledge of function parameters. 

1 Introduction 

Accelerated gradient schemes were first proposed by Yurii Nesterov in 1983, [H]. He demonstrated 
a simple modification to gradient descent that could obtain provably optimal performance for 
the complexity class of first-order algorithms applied to minimize smooth convex functions. The 
method, and its successors, are often referred to as 'accelerated methods'. In recent years there has 
been a resurgence of interest in first-order optimization methods |191 [T6\ [2H [H [12] , driven primarily 
by the need to solve very large problem instances unsuited to second-order methods. 

Accelerated gradient schemes can be thought of as momentum methods, in that the step taken 
at the current iteration depends on the previous iterations, and where the momentum grows from 
one iteration to the next. When we refer to restarting the algorithm we mean starting the algorithm 
again, taking the current iteration as the new starting point. This erases the memory of previous 
iterations and resets the momentum back to zero. 

Unlike gradient descent, accelerated methods are not guaranteed to be monotone in the objective 
value. A common observation when running an accelerated method is the appearance of ripples or 
bumps in the trace of the objective value; these are seemingly regular increases in the objective, 
see Figure ([T]) for an example. In this paper we demonstrate that this behavior occurs when the 
momentum has exceeded a critical value (the optimal momentum value derived by Nesterov in 
|15j ) and that the period of these ripples is proportional to the square-root of the (local) condition 
number of the function. Separately, we show that the optimal restart interval is also proportional 
to the square root of the condition number. Combining these results we show that restarting when 
we observe an increase in the function value allows us to recover the optimal linear convergence rate 
in many cases. Indeed if the function is locally well-conditioned we can use restarting to obtain a 
linear convergence rate inside the well-conditioned region. 
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Smooth unconstrained optimization. We wish to minimize a smooth convex function of a 
variable x G R" [3], 

minimize fix) (1) 
where / : R" — )• R has a Lipschitz continuous gradient with constant L, i.e., 

||V/(x) - V/(y)||2 < L\\x - y||2, Vx,y G R". 

We shall denote by /* the optimal value of the above optimization problem, if the minimizer exists 
and is unique then we shall write it as x*. Further, a function is said to be strongly convex if there 
exists a ^ > such that 

/(x)>r+ W2)||x-x1|2, VxGR", 

where \i is referred to as the strong convexity parameter. The condition number of a smooth, 
strongly convex function is 



2 Accelerated methods 

Accelerated first-order methods to solve ([T]) were first developed by Nesterov |14] . this scheme is 
from |15j : 

Algorithm 1 Accelerated scheme I 
Require: x° G R*^, %p = 6*0 = 1 and q G [0, 1] 
1: for A: = 0, 1,... do 

3: 6*^+1 solves Ql^^ = (1 - ^'fc+O^'i + 

5: = x'^+l + /3fc+l(x'=+^ - x'^) 

6: end for 



There are many variants of the above scheme, see, e.^., [211 [161 [121 [H [2] . Note that by setting 
g = 1 in the above scheme we recover gradient descent. For a smooth convex function the above 
scheme converges for any < l/i>; setting = 1/L and g = obtains a guaranteed convergence 
rate of 

AT II 1-0 _ 'r-*l|2 

/(.Vr<^*^. (2) 

If the function is also strongly convex with parameter then a choice of g = /i/L (the reciprocal 
of the condition number) will achieve 

f{x^)-r<L{\-^\\x^-x^f. (3) 

This is often referred to as linear convergence. With this convergence rate we can achieve an 
accuracy of e in 

of./^logi^ (4) 



/i e 



iterations. 
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In the case of a strongly convex function the following simpler scheme obtains the same guar- 
anteed rate of convergence [15j : 



Algorithm 2 Accelerated scheme II 
Require: E R", y*^ = 
1: for A; = 0, 1, . . . do 

2: x^+^ =y^ - {l/L)V f{y^) 

3: yk+^ =x^+^ + P*{x^+^ -X^) 

4: end for 



Where we set 

= i^^. (5) 

Note that in Algorithm [H using the optimal choice q = fi/L, we have that /3fc f /3*. Taking (3^ 
to be a momentum parameter, then for a strongly convex function /3* is the maximum amount 
of momentum we should apply; when we have a value of /3 higher than (3* we refer to it as 'high 
momentum'. We shall return to this point later. 

The convergence of these schemes is optimal in the sense of the lower complexity bounds derived 
by Nemirovski and Yudin in |13] . However, this convergence is only guaranteed when the function 
parameters /U and L are known in advance. 

2.1 Robustness 

A natural question to ask is how robust are accelerated methods to errors in the estimates of the 
Lipschitz constant L and strong convexity parameter ^u? For the case of an unknown Lipschitz 
constant we can estimate the optimal step-size by the use of backtracking; see, e.g., [211 IH]- 
Estimating the strong convexity parameter is much more challenging. 

Estimating the strong convexity parameter. In [16] Nesterov demonstrated a method to 
bound ;U, similar to the backtracking scheme for L described above. His scheme achieves a conver- 
gence rate quite a bit slower than Algorithm [1] with a known value of ji. In practice, we often assume 
or guess that ^ is zero, which corresponds to setting g = in Algorithm [TJ Indeed many discussions 
of accelerated algorithms do not even include a q term; the original algorithm in [13] did not use 
a q. However, this can dramatically slow down the convergence of the algorithm. Figure [1] shows 
Algorithm [Tj applied to minimize a positive definite quadratic function in n = 200 dimensions, with 
optimal choice of q being q* = iijL = 4.1 x 10~^ (a condition number of about 2.4 x 10'^), and 
step size t = 1/L. Each trace is the progress of the algorithm with a different choice of q (hence a 
different estimate of /u). 

We observe that slightly over or underestimating the optimal value of q for the function can 
have a severe detrimental effect on the rate of convergence of the algorithm. We also note the clear 
difference in behavior between the cases where we underestimate and where we overestimate g*; in 
the latter we observe monotonic convergence but in the former we notice the appearance of regular 
ripples or bumps in the traces. 
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Figure 1: Convergence of Algorithm [T] with different estimates of q. 



Interpretation. The optimal momentum depends on the condition number of the function; 
specifically, higher momentum is required when the function has a higher condition number. Under- 
estimating the amount of momentum required leads to slower convergence. However we are more 
often in the other regime, that of overestimated momentum, because generally g = 0, in which case 
/3fc t Ij this corresponds to high momentum and rippling behavior, as we see in Figure [T] This 
can be visually understood in Figure ([2|), which shows the trajectories of sequences generated by 
Algorithm [T] minimizing a positive definite quadratic in two dimensions, under q = q*, the optimal 
choice of g, and q = 0. The high momentum causes the trajectory to overshoot the minimum and 
oscillate around it. This causes a rippling in the function values along the trajectory. Later we 
shall demonstrate that the period of these ripples is proportional to the square root of the (local) 
condition number of the function. 

Lastly we mention that the condition number is a global parameter; the sequence generated by 
an accelerated scheme may enter regions that are locally better conditioned, say, near the optimum. 
In these cases the choice of q = q* is appropriate outside of this region, but once we enter it we 
expect the rippling behavior associated with high momentum to emerge, despite the optimal choice 
of q. 
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Figure 2: Sequence trajectories under Algorithm [TJ 

3 Restarting 
3.1 Fixed restart 

For strongly convex functions an alternative to choosing the optimal value of q is to use restarting, 
[16^ llOj . One example of a fixed restart scheme is as follows: 

Algorithm 3 Fixed restarting 

Require: x° G R*^, y° = x°, 6*0 = 1 
1: for j = 0, 1, . . . do 

2: carry out Algorithm [T] with q = for k steps 
3; set = x^, = x^ and = 1- 
4: end for 



We restart the algorithm every k iterations, taking as our starting point the last point produced 
by the algorithm, where /e is a fixed restart interval. In other words we 'forget' all previous iterations 
and reset the momentum back to zero. 

Optimal fixed restart interval. We can obtain an upper bound on the optimal restart interval. 
If we restart every k iterations we have, at outer iteration j, inner loop iteration k (just before a 
restart), 

where the first inequality is the convergence guarantee of Algorithm [1] and the second comes from 
the strong convexity of /. So after jk steps we have 

/(xO-.o))-r<(8LM^)^/(x(0'°))-n. 
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If we assume we have jk = c total iterations and we wish to minimize {8L/i_ik'^y over j and k 
jointly, we obtain 

k* = e^JsL/fi. (6) 

Using this as our restart interval we obtain an accuracy of e in less than 0{\/LjJllog(l / e)) iterations, 
i.e., the optimal linear convergence rate as in equation ([4]). 

The drawbacks in using fixed restarts are that firstly it depends on unknown parameters L and, 
more importantly, /i, and secondly it is a global parameter that may be inappropriate in better 
conditioned regions. 

3.2 Adaptive restart 

The above analysis suggests that an adaptive restart technique may be useful. In particular we 
want a scheme that makes some computationally cheap observation and decides whether or not 
to restart based on that observation. In this paper we suggest two schemes that perform well in 
practice and provide some analysis to show accelerated convergence when these schemes are used. 

• Function scheme: we restart whenever 

fix'') > fix''-'). 

• Gradient scheme: we restart whenever 

Empirically we observe that these two schemes perform similarly well. The gradient scheme has 
two advantages over the function scheme. Firstly near to the optimum the gradient scheme may 
be more numerically stable. Secondly all quantities involved in the gradient scheme are already 
calculated in accelerated schemes, so no extra computation is required. 

We can give rough justifications for each scheme. The function scheme restarts at the bottom 
of the troughs as in Figure [H thereby avoiding the wasted iterations where we are moving away 
from the optimum. The gradient scheme restarts whenever the momentum term and the negative 
gradient are making an obtuse angle. In other words we restart when the momentum seems to be 
taking us in a bad direction, as measured by the negative gradient at that point. 

Figure [3] shows the effect of different restart intervals on minimizing a positive definite quadratic 
function in re = 500 dimensions. In this particular case the upper bound on the optimal restart 
interval is every 700 iterations. We note that when this interval is used the convergence is better 
than when no restart is used, however not as good as using the optimal choice of q. We also 
note that restarting every 400 iterations performs about as well as restarting every 700 iterations, 
suggesting that the optimal restart interval is somewhat lower than 700. We have also plotted the 
performance of the two adaptive restart schemes. The performance is on the same order as the 
algorithm with the optimal q and much better than using the fixed restart interval. (Conjugate 
gradient methods, [11], will generally outperform an accelerated gradient scheme when minimizing 
a quadratic; we use quadratics here simply for illustrative purposes.) 

Figure m demonstrates the function restart scheme trajectories in the two dimensional example, 
restarting resets the momentum and prevents the characteristic spiralling behavior. 
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Figure 4: Sequence trajectories under scheme I and with adaptive restart. 
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4 Analysis 



In this section we consider applying an accelerated scheme to minimizing a positive definite quadratic. 
We shall see that once the momentum is larger than a critical value we observe periodicity in the 
iterates. We use this to prove linear convergence when using adaptive restarting. The analysis 
presented in this section is similar in spirit to the analysis of the heavy ball method in \18\ §3.2]. 

4.1 Minimizing a quadratic 

Consider minimizing a convex quadratic. Without loss of generality we can assume that / has the 
following form: 

f{x) = {l/2)x^Ax 

where A G j^'t-x" positive definite and symmetric. In this case x* = and /* = 0. We have 
strong convexity parameter /i = Amin > and L = Amax, where Amin and Amax ai'e the minimum 
and maximum eigenvalues of A, respectively. 

4.2 The algorithm as a hnear dynamical system 

We shall assume a fixed step-size t = 1/L for simplicity. Given quantities x^ and = x^, Algorithm 
[T]is carried out as follows, 

^fc+i ^ - {l/L)Ay'' 

yk+l _ ^.fc+l _|_ ^^_(|^fe+l _ ^fe-J_ 

For the rest of the analysis we shall take Pk to be constant and equal to some (3 for all k. This 
is a somewhat crude approximation, but by making it we can show that there are two regimes of 
behavior of the system, depending on the value of /3. Consider the eigenvector decomposition of 
A = VKV'^. Denote by = V'^ x^, = V'^y^. In this basis the update equations can be written 

^k+i ^ v^-{l/L)kv^ 
These are n independently evolving dynamical systems. The ith system evolves according to 

where Aj is the ith eigenvalue of A. Eliminating the sequence fl*^^ from the above we obtain the 
following recurrence relation for the evolution of Wi: 

«;f+2 = (1 + /3)(i _ \,/L)w^+^ - - K/L)wt A: = 0, 1, ... , 

where is known and wj = w^{l — Xi/L), i.e., a gradient step from w^. 

The update equation for Vi is identical, differing only in the initial conditions, 

= (1 + ;3)(i _ X,/L)v^+^ - - K/L)vt A; = 0, 1, ... , 
where = w° and vj = ((1 + p)il - Xi/L) - /3)v^. 
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4.3 Convergence properties 

The behavior of this system is determined by the characteristic polynomial of the recurrence rela- 
tion, 

-{l + m-h/L)r + P{l-\/L). (7) 
Let /?* be the critical value of (3 for which this polynomial has repeated roots, i.e., 

^ l-yA"7L 

If /3 < /3* then the polynomial {[7]) has two real roots, and r2, and the system evolves according 
to|7! 

Wi = cir'l + C2r^. (8) 

When /3 = /3* the roots coincide at the point r* = (1 + /3)(1 - \i/L)/2 = (1 - \/\i/L); this 
corresponds to critical damping. We have the fastest monotone convergence at rate oc (1 — \/ Xi/L)^. 
Note that if Aj = /i then /3* is the optimal choice of /3 as given by equation ([5]) and the convergence 
rate is the optimal rate, as given by equation ([3|). This is the case because, as we shall see, the 
smallest eigenvalue will come to dominate the convergence of the entire system. 

If /3 < /3* we are in the low momentum regime, and we say the system is over-damped. The 
convergence rate is dominated by the larger root, which is greater than r*,i.e., the system exhibits 
slow monotone convergence. 

If /3 > /3* then the roots of the polynomial ([7]) are complex; we are in the high momentum 
regime and the system is under-damped and exhibits periodicity. In that case the characteristic 
solution is given by [7] 

w'l = c, (/3(1 - A,/L))'=/2 (cos(A:Vi - <5,)) 

where 

= cos-^((l - A,/L)(l + /3)/2^/3(l - Xi/L)). 

and 5i and q are constants that depend on the initial conditions; in particular for /? w 1 we have 
5i ~ and we will ignore it. Similarly, 

v'^ = Ci (/3(1 - Xi/L)f^ (cos(fc^i - Si)) 

where Si and Cj are constants, and again Si ~ 0. For small 6 we know that cos~"^(v'l — d) ~ VO, 
and therefore if Aj ^ L, then 

ipi « ^JXi/L. 

In particular the frequency of oscillation for the mode corresponding to the smallest eigenvalue fi 
is approximately given by tpf^ ~ yjiijlj. 

To summarize, based on the value of /3 we observe the following behaviors: 

• /3 > high momentum, under-damped 

• /3 < /3*: low momentum, over-damped 

• /3 = optimal momentum, critically damped. 
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4.4 Observable quantities 

We don't observe the evolution of the modes, but we can observe the evolution of the function 
value; which is given by 

n 
i=l 

and if /3 > /3* = (1 — y7l/L)/(l + y/JilV) we are in the high momentum regime for all modes and 
thus 

n n 
i=l i=l 

The function value will quickly be dominated by the smallest eigenvalue and we have that 

f{w^) ^ {wlf^iP\l - lijLf cos' [k^^ , (9) 

where we have replaced ipf^ with ^/JI/L, and we are using the subscript fi to denote those quantities 
corresponding to that mode. A similar analysis for the gradient restart scheme yields 

V/(/)^(x'=+i - x'') w nvf^iwf;^^ - wf^) oc /3'=(1 - /i/L)^' sm{2k,/^). (10) 

In other words observing the quantities in ([9|) or (fTO]) we expect to see oscillations at a frequency 
proportional to \/ fJ,/L, i.e., the frequency of oscillation is telling us something about the condition 
number of the function. 

4.5 Convergence with adaptive restart 

If we apply Algorithm [1] with g = to minimize a quadratic we start with /3o = 0, i.e., the system 
is in the low momentum, monotonic regime. Eventually /3k becomes larger than /3* and we enter 
the high momentum, oscillatory regime. It takes about (3/2)^^7/1 iterations for (3^ to exceed /3*. 
After that the system is under-damped and the iterates obey equations ([9]) and (jlOp . Under either 
adaptive restart scheme, equations ([9]) and (jlOp indicate that we shall observe the restart condition 
after a further (7r/2)\/L//x iterations. We restart and the process begins again, with set back to 
zero. Thus under either scheme we restart approximately every 

2 Vf 

iterations (c/., the upper bound on optimal fixed restart interval ([6])). Following a similar derivation 
to ^3.11 this restart interval guarantees us an accuracy of e within 0{y/L/iilog{l/e)) iterations, 
i.e., we have recovered the optimal linear convergence rate of equation ^ via adaptive restarting, 
with no prior knowledge of /u. 

4.6 Extension to smooth convex minimization 

In many cases the function we are minimizing is well approximated by a quadratic near the optimum, 
i.e., there is a region inside of which 

fix) « fix'') + (x - x^)^V2/(^*)(x - X*), 
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and loosely speaking we are minimizing a quadratic. Once we are inside this region we will observe 
behavior consistent with the analysis above, and we can exploit this behavior to achieve fast con- 
vergence by using restarts. Note that the Hessian at the optimum may have smallest eigenvalue 
Amin > A*) the global strong convexity parameter, in other words we can achieve a faster local 
convergence than even if we had exact knowledge of the global parameter. This result is similar in 
spirit to the restart method applied to the non-linear conjugate gradient method, where it is desir- 
able to restart the algorithm once it reaches a region in which the function is well approximated 
by a quadratic [171 §5.2]. 

The effect of these restart schemes outside of the quadratic region is unclear. In practice we 
observe that restarting based on one of the criteria described above is almost always helpful, even far 
away from the optimum. However, we have observed cases where restarting far from the optimum 
can slow down the early convergence slightly, until the quadratic region is reached and the algorithm 
enters the rapid linear convergence phase. 

5 Numerical examples 

In this section we describe three further numerical examples that demonstrate the improvement of 
accelerated algorithms under an adaptive restarting technique. 

5.1 Log-sum-exp 

Here we minimize a smooth convex function that is not strongly convex. Consider the following 
optimization problem 

minimize plog (^J27^i 6xp (^{afx — bi)/ pj^ 

where x G R". The objective function is smooth, but not strongly convex, it grows linearly 
asymptotically. Thus, the optimal value of q in Algorithm [T] is zero. The quantity p controls the 
smoothness of the function, as p — t- 0, f{x) — t- maxj=i^,,,^m(aj'x — As it is smooth, we expect 
the region around the optimum to be well approximated by a quadratic (we consider only examples 
where the optimal value is finite) , and thus we expect to eventually enter a region where our restart 
method will obtain linear convergence without any knowledge of where this region is, the size of the 
region or the local function parameters within this region. For smaller values of p the smoothness 
of the objective function decreases and thus we expect to take more iterations before we enter the 
region of linear convergence. 

As a particular example we took n = 20 and m = 100; we generated the Oj and hi randomly. 
Figure E] demonstrates the performance of four different schemes for four different values of p. We 
selected the step size for each case using backtracking. We note that both restart schemes perform 
well, eventually beating both gradient descent and the accelerated scheme. Both the function and 
gradient schemes eventually enter a region of fast linear convergence. For large p we see that even 
gradient descent performs well, as, similar to the restarted method, it is able to automatically 
exploit the local strong convexity of the quadratic region around the optimum. Notice also the 
appearance of the periodic behavior. 
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Figure 5: Minimizing a smooth but not strongly convex function; the black line is gradient 
descent, the blue line is Algorithm [U the red line is the function adaptive restart scheme, 
the green line is the gradient adaptive restart scheme. 



5.2 Sparse linear regression 

Consider the following optimization problem: 

minimize (l/2)||Aa; - ftlH + p||x||i, (11) 

over X S R", where A S j^'^x" and in general n ^ m. This is a widely studied problem in the 
field of compressed sensing, see e.g., [H [U O [20]. Loosely speaking problem ([TT]) seeks a sparse 
vector with a small measurement error. The quantity p trades off these two competing objectives. 
The iterative soft-threshold algorithm (ISTA) can be used to solve (|lip [6l[8]. ISTA relies on the 
soft-thresholding operator: 

Ta{x) = sign(a;) max(|x| — a, 0), 

where all the operations are applied elementwise. The ISTA algorithm, with constant step-size t, 
is given by 
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Algorithm 4 ISTA 

Require: x^^^ G R" 
1: for = 0, 1, . . . do 

2: x''+^ = Tpt{x^ -tA'^iAx'' -b)). 
3: end for 



The convergence rate of ISTA is guaranteed to be at least 0{l/k), making it analogous to 
gradient descent. 

The fast iterative soft thresholding algorithm (FISTA) was developed in [2]; a similar algo- 
rithm was also developed by Nesterov in [16]. FISTA essentially applies acceleration to the ISTA 
algorithm; it is carried out as follows, 

Algorithm 5 FISTA 

Require: x^^^ G R", 1/° = 2;° and Oq = 1 
1: for A; = 0,1,... do 

2: = Tptiy'' - tA^jA y'' - b)) 

3: Ok+i = (1 + yr+4^)/2 

4: /3fc+i = iOk - i)/ek+i 

5: = x'^+l + f3k+l - X*^) . 

6: end for 



For any choice of t < 1/Amax(^^^) FISTA obtains a convergence rate of at least 0{l/k'^). The 
objective in problem (jlip is non-smooth, so it does not fit the class of problems we are considering 
in this paper. However we are seeking a sparse solution vector x, and we note that once the non-zero 
basis of the solution has been identified we are essentially minimizing a quadratic. Thus we expect 
that after a certain number of iterations adaptive restarting can provide linear convergence. 

It is easy to show that the function adaptive restart scheme can be performed without an extra 
application of the matrix A, which is the costly operation in the algorithm. 

In performing FISTA we do not evaluate a gradient, however FISTA can be thought of as a 
generalized gradient scheme, in which we take 

x"^' = Txtiy" - tA'^^Ay^ - b)) := / - 

to be a generalized gradient step, where G{y^) is the generalized gradient at y^. In this case the 
gradient restart scheme amounts to restarting whenever 

G{y^f{x^+^ - x^) > 0, 

or equivalently 

We generated data for the numerical instances as follows. Firstly the entries of A were sampled 
from a standard normal distribution. We then randomly generated a sparse vector y with n entries, 
only s of which were non-zero. We then set b = Ay + w, where the entries in w were IID sampled 
from A^(0, 0.1). This ensured that the solution vector x* is approximately s-sparse. We chose p = 1 
and the step size t = 1/Amax(^"^^)- Figure [6] shows the dramatic speedup that adaptive restarting 
can provide, for two different examples. 



13 




5.3 Quadratic programming 

Consider the following quadratic program, 

minimize {l/2)x^ Qx + x , . 

subject to a < x < 6, 

over X G R", where Q G Yi^^^ is positive definite and a,b G R" are fixed vectors. The constraint 
inequalities are to be interpreted element-wise, and we assume that a < b. We denote by I^ci^) the 
projection of a point z onto the constraint set, which amounts to thresholding the entries in z. 
Projected gradient descent can solve (|13p : it is carried out as follows, 

x'^+i = He (x'^ - t{Qx'' + g)) . 

Projected gradient descent obtains a guaranteed convergence rate of 0{l/k). Acceleration has been 
successfully applied to the projected gradient method, [T6l [2], 



Algorithm 6 Accelerated projected gradient 
Require: x^ G R", = x^ and = 1 
1: for A; = 0,1,... do 

2: = Uc {y'' - t{Qy>' + q)) 

3: Ok+i solves el^-^ = (1 - ek+i)el 
4: I3k+i = ek{i-ek)/{9l + ek+i) 

5: = x^+^ + /3fc+i(x'=+^ - X^) 

6: end for 



For any choice of t < l/Amax(Q) accelerated projected gradient schemes obtain a convergence 
rate of at least 0{l/k'^). 
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The presence of constraints make this a non-smooth optimization problem, however once the 
constraints that are active have been identified the problem reduces to minimizing a quadratic on 
a subset of the variables, and we expect adaptive restarting to increase the rate of convergence. As 
in the sparse regression example we can use the generalized gradient in our gradient based restart 
scheme, i.e., we restart based on condition (fT2|) . 

As a final example, we set n = 500 and generate Q and q randomly; Q has a condition number 
of 10^. We take b to be the vector of all ones, and a to be that of all negative ones. For information, 
the solution to this problem has 70 active constraints. The step-size is set to t = l/Aniax(Q) for all 
algorithms. Figure [7] shows the performance of projected gradient descent, accelerated projected 
gradient descent, and the two restart techniques. 
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Figure 7 : Adaptive restarting applied to the accelerated projected gradient algorithm. 



6 Summary 

In this paper we have demonstrated a simple heuristic adaptive restart technique that can improve 
the convergence performance of accelerated gradient schemes for smooth convex optimization. We 
restart the algorithm whenever we observe a certain condition on the objective function value or 
gradient value. We provided some qualitative analysis to show that we can recover the optimal 
linear rate of convergence in many cases; in particular near the optimum of a smooth function we 
can potentially dramatically accelerate the rate of convergence, even if the function is not glob- 
ally strongly convex. We demonstrated the performance of the scheme on some simple numerical 
examples. 
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